perm filename RESULT[DIS,DBL]5 blob
sn#219335 filedate 1976-06-12 generic text, type C, neo UTF8
COMMENT ⊗ VALID 00011 PAGES
C REC PAGE DESCRIPTION
C00001 00001
C00002 00002 .NSECP(Results)
C00004 00003 .SSEC(What AM Did)
C00006 00004 . SSSEC(AM as a Mathematician)
C00007 00005 . SSSEC(AM as a Computer Program)
C00011 00006 .SSEC(Experiments with AM)
C00021 00007 . SSSEC(Must the Worth numbers be finely tuned?)
C00031 00008 . SSSEC(How finely tuned is the Agenda?)
C00040 00009 . SSSEC(What if certain key concepts are eliminated?)
C00042 00010 . SSSEC(What if certain heuristics are tampered with?)
C00043 00011 . SSSEC(Can AM work in a new domainα: Plane Geometry?)
C00045 ENDMK
C⊗;
.NSECP(Results)
.R1PAGE: PAGE;
This chapter opens by summarizing what AM "did". Section 1 gives a
fairly high-level description of the major paths which were explored,
the concepts discovered along the way, the relationships which were
noticed, and ones which "should" have been but but weren't.
The next section continues this exposition by presenting the results
of experiments which were done with (and ⊗4on⊗*) AM.
.ONCE TURN ON "{}"
Chapter {[2] EVALU} will draw upon these results -- and others given
in the appendices -- to form conclusions about AM. Several meta-level
questions will be tackled there (e.g., What are AM's limitations?).
.SSEC(What AM Did)
<<This contains duplicate info. to that given in Chap. 1: AM as mathematician>
AM is both a mathematician of sorts, and a big computer program.
.BEGIN TURN ON "{}"
By granting AM more anthropomorphic qualities than it deserves, we
can describe its progress through elementary mathematics. It
rediscovered many well-known concepts, a couple interesting but
not-generally-known ones, and several concepts which were hitherto
unknown and should have stayed that way. Section
{SECNUM}.{SSECNUM}.1 recaps what AM did, much as a historian might
critically evaluate Euler's work.
By under-estimating AM's sophistication, one can demand answers to
the typical questions to ask about a computer program: how big is it,
how much cpu time does it use, what is the input/output, etc. These
are treated in Section {SECNUM}.{SSECNUM}.2.
.END
. SSSEC(AM as a Mathematician)
<<See p.11 OVERV -- chap 1 -- for updated version. Why repeat?>
. SSSEC(AM as a Computer Program)
<<Is this section worth including? Change of format, perhaps?>
When viewed as a large LISP program, there is very little of interest about
AM. There are the usual battery of customized functions (e.g., a conditional
PRINT function), the storage hacks (special emergency garbage collection
routines, which know which facets are expendible), the time hacks
(omniciently arrange clauses in a conjunction so that the one most likely
to fail will come first), and the bugs (if the user renames a concept
while it's the current one being worked on,
there is a 5% chance of AM entering an infinite loop).
Below are listed a few parameters of the system, although I doubt that
they hold any theoretical significance. The reader may be curious about how
big AM, how long it takes, etc.
Machine: SUMEX, PDP-10, KI-10 uniprocessor, 256k core memory.
Language: Interlisp, January '75 release,
which occupies 140k, but which provides a surplus "shadow space"
of 256k additional words available for holding compiled code.
AM support code:
200 compiled (not block-compiled) utility routines, control routines, etc.
Occupy roughly 100k, but all are pushed into the shadow space.
AM itself: 115 concepts, each occupying about 1k. Facet/entries stored as
property/value on the property list of atoms whose names are concepts' names.
Snazzy feature:
Executable entries on facets (e.g., an entry on Union.Alg) are stored uncompiled
until the first time they are actually called on, at which time they are compiled
and then executed.
"Final" state of AM: 300 concepts, each occupying about 1k. Many are swapped out
onto disk. Original 115 concepts have grown to an average size of 2k.
Cpu time (and number of tasks chosen) at the moment various discoveries were made:
.BEGIN FILL PREFACE 0 INDENT 10 TURN ON "\" TABS 30,45
Cardinality:\15 minutes;\40 tasks
Unique factorization:\25 minutes;\60 tasks
<Maybe give a few more, intermediate ones>
.E
<<Other details about AM as a computer pgm>
.SKIP 1
.SSEC(Experiments with AM)
.EXPT: SECNUM;
.EXPTSSEC: SSECNUM;
.EXPTPAGE: PAGE;
Now we've described the activities AM carried out during its best
run. AM was working by itself, and each time executed the top task
on the agenda. It received no help from the user, and all its
concepts' Intuitions facets had been removed.
One valuable aspect of AM is that it is amenable to many kind of
interesting experiments. Although AM is too ⊗4ad hoc⊗* for numerical
results to have much significance, the qualitative results perhaps do
have some valid things to say about research in elementary
mathematics, about automating research, and at least about the
efficacy of various parts of AM's design.
This section will explain what it means to perform an experiment on
AM, what kinds of experiments are imaginable, which of those are
feasible, and finally will describe the 5 experiments which were
performed on AM.
By modifying AM in various ways, its behavior can be altered, and the
⊗4quality⊗* of its behavior will change as well. As a drastic
example, if all the heuristics were removed, then AM would just "sit
there" -- with no plausible generators, its agenda would remain
empty.
By careful planning, each experiment can tell us something new about
AM: how valuable a certain piece of it is, how robust a certain
scheme really is, etc. The resuts of these experiments would then
have something to contribute to a discussion of the "real
intelligence" of AM (e.g., what features were superfluous), and
contribute to the design of the "next" AM-like system. Generalizing
from those results, one might suggest some hypotheses about the
larger task of automated math research.
Let's cover the different ⊗4kinds⊗* of experiments one could perform
on AM:
(i) Remove individual concept modules, and/or individual heuristic
rules. Then examine how AM's performance is degraded. AM should
operate even if most of its heuristic rules and most of its concept
modules were excised, so long as ⊗4any⊗* facets of any of the concepts
still exist. If the remaining fragment of AM is too small, however,
it may not be able to find anything interesting to do. In fact, this
situation was actually encountered experimentally, when the first few
partially complete concepts were inserted. If only a little bit of AM
is removed, the remainder will in fact keep operating without this
"uninteresting collapse". The converse situation should also hold:
although still functional with any concept module unplugged, AM's
performance ⊗4should⊗* be noticably degraded. That is, while not
indispensible, each concept should nontrivially help the others. The
same holds for each individual heuristic rule. Which concepts does
AM now "miss" discovering? Is the removed concept/heuristic later
discovered anyway by those which are left in AM? This should
indicate the importance of each kind of concept and rule which AM
starts with.
(ii) Vary the relative weights given to features by the criteria
which judge aesthetics, interestingness, worth, utility, etc. See
how important each factor is in directing AM along successful routes.
In other words, vary the little numbers in the formulae (both the
global formulae and the local ones inside heuristic rules). One
important result will be some idea of the robustness or "toughness"
of the numeric weighting factors. If the system easily collapses, it
was too finely tuned to begin with.
(iii) Add several new concept modules (including new heuristics) and
see if AM can work in some unanticipated field of mathematics (like
graph theory or calculus or plane geometry). Do earlier acheivements
-- new concepts and relationships involving concepts -- have any
impact in the new domain? Are some specialized heuristics from the
first domain totally wrong here? Do all the old general heuristics
still hold here? Are they sufficient, or are some "general"
heuristics needed here which weren't needed before? Does AM "slow
down" as more and more concepts get introduced?
(iv) Try to have AM
develop nonmathematical theories (like elementary physics, or program
verification). This might require limiting AM's freedom to
"ignore a given body of data and move on to something more
interesting". The exploration of very non-formalizable fields (e.g.,
politics) might require much more than a small augmentation of AM's
base of concepts.
(v) Add several new concepts dealing with proof, and of course add
all the associated heuristic rules. Such rules would advise AM on the
fine points of using various techniques of proof/disproof: when to
use them, what to try next based on why the last attempt failed, etc.
See if the ⊗4kinds⊗* of discoveries AM makes are increased.
During the few weeks
prior to the writing of this document,
several experiments (of types i, ii, and
iii above$$ experiments of type (iv) weren't tried and are left as
"open problems", as invitations for future research efforts.
Experiement (v) will probably occupy the summer of 1976. $) have been
set up and performed on AM. We're now ready to examine each of them
in detail. The following points are covered for each experiment:
.BN
λλ How was it thought of?
λλ What will be gained by it? What would be the implications of the
various possible outcomes?
λλ How was the experiment set up? What preparations/modifications had
to be made? How much time (man-hours) did it take?
λλ What happened? How did AM's behavior change? Was this expected?
Analysis.
λλ What was learned from this experiment? Can we conclude anything
which suggests new experiments (e.g., use a better machine, a new
domain) or which bears on a more general problem that AM faced (e.g.,
a new way to teach math, a new idea about doing math research)?
.E
. SSSEC(Must the Worth numbers be finely tuned?)
.EX3PAGE: PAGE;
Each of the 115 initial concepts
has supplied to it (by the author) a number between 0
and 1000, stored as its Worth facet, which is supposed to represent
the overall value of the concept. "Compose" has a higher initial
Worth than "Structure-delete", which is higher than "Equality"$$ As AM
progresses, it notices something interesting about Equality every now
and then, and pushes its Worth value upwards. $.
Frequently, the priority of a task involving C depends on the overall
Worth of C.
How sensitive is
AM's behavior to the initial settings of the Worth facets? How
finely tuned must these initial Worth values be?
To test this, a simple experiment was performed. Just before starting
AM, the mean value of all concepts' Worth values was computed (around
200). Then each concept had its Worth reset to the value 200. This
was done "by hand", by the author, in a matter of seconds. AM was
then started and run as if there were nothing amiss, and its behavior
was watched carefully.
What happened? By and large, the same major discoveries were made --
and missed -- as usual, in the same order as usual. But whereas AM
proceeded fairly smoothly before, with little superflous activity, it
now wandered quite blindly for long periods of time, especially at
the very beginning. Once AM "hooked into" a line of productive
development, it followed it just as always, with no noticable
additional wanderings. As one of these lines of developments died
out, AM would wander around again, until the next one was begun.
It took roughly three times as long for each major discovery to occur
as normal. This "delay" got shorter and shorter as AM developed
further. In each case, the tasks preceding the discovery and
following it were pretty much the same as normal; only the tasks
"between" two periods of development were different -- and much more
numerous. The precise numbers involved would probably be more
misleading than helpful, so they will not be given$$ Any reader who
wishes to perform this experiment can simply say [MAPC CONCEPTS
'(LAMBDA (c) (SETB c WORTH 200] to Interlisp, just before typing
(START) to begin AM. $.
The reader may be interested to learn that the Worth values of many
of the concepts -- and most of the new concepts -- ended up very
close to the same values that they achieved in the original run.
Overrated concepts were investigated and proved boring; underrated
concepts had to wait longer for their chances, but then quickly proved
interesting and had their Worth facets boosted.
The conclusion I draw from this change in behavior is that the Worth
facets are useful for making blind decisions -- where AM must choose
based only on the overall worths of the various concepts in its
repertoire. Whenever a specific reason existed, it was far more
influential than the "erroneous" Worth values. The close, blind,
random decisions occur between long bursts of specific-reason-driven
periods of creative work.
The general answer, then, is ⊗4No⊗*, the initial settings of the Worth
values are not crucial. Guessing reasonable initial values for them is
merely a time-saving device.$$
This suggests
an interesting research problem: what impact does
the quality of initial starting values have on humans? Give several bright
undergraduate math majors the same set of objects and operators to play with,
but tell some of them (i) nothing, and some of them (ii) a certain few pieces of
the system are very promising, (iii)
emphasize a diffferent subset of the objects and operators.
How does "misinformation" impede the humans? How about no information?
Have them give verbal protocols about where they are focussing their attention,
and why. $
Albeit at a nontrival cost, the Worth facets did manage to correct
themselves by the end of a long$$ A couple cpu hours, about a
thousand tasks total selected from the agenda $ run. What would
happen if the Worth facets of those 110 concepts were not only
initialized to 200, but were held fixed at 200 for the duration of
the run?
In this case, the delay factor still decreased. That is, AM
still got more and more "back to normal" as it progressed. The reason
is because AM's later work dealt with concepts like Primes,
Square-root, etc., which were so far removed from the initial base of
concepts that the initial concepts' Worths were of little consequence.
Even more drastically, we could force all the Worth facets of all
concepts -- even newly-created ones -- to be kept at the value 200
forever. In this case, AM's behavior doesn't completely disintegrate,
but that delay factor actually increases with time: apparently, AM
begins to suffer from the exponential growth of "things to do" as its
repertoire of concepts grows linearly. Its purposiveness, its
directionality depends on "focus of attention" more and more, and if
that feature is removed, AM loses much of its rationality. A factor
of 5 delay doesn't sound that bad "efficiency-wise", but the actual
apparent behavior of AM is as staccato bursts of development,
followed by wild leaps to unrelated concepts. AM no longer can
"permanently" record its interest in a certain concept.
So we conclude that the Worth facets are (i) not finely tuned, yet
(ii) provide important global information about the relative values
of concepts. If the Worth facets are completely disabled, the
rationality of AM's behavior hangs on the slender thread of "focus of
attention".
. SSSEC(How finely tuned is the Agenda?)
.RTEXSSSEC: SSSECNUM;
.COMMENT This assumes that expt number = sssec number;
.RTEXNO: SSSECNUM;
The top few candidates on the agenda always appear to be reasonable
(to me). If I work with the system, guiding it, I can cause it to
make a few discoveries it wouldn't otherwise make, and I can cause it
to make its typical ones much faster (about a factor of 2). Thus the
⊗4very⊗* top task is not always the best.
If AM randomly selects one of the top 20 or so tasks on the agenda
each time, what will happen to its behavior? Will it disintegrate,
slow down by a factor of 10, slow down slightly,...?
This experiment required only a few seconds to set up, but demanded a
familiarity with the LISP functions which make up AM's control
structure. At a certain point, AM asks for Best-task(Agenda).
Typically, the LISP function Best-task is defined as CAR -- i.e.,
pick the first member from the list of tasks. What I did was to
redefine Best-task as a function which randomly selected n from the
set {1,2,...,20}, and then returned the n⊗Ath⊗* member of the
job-list.
If you watch the top job on the agenda, it will take about 10 cycles
until AM chooses it. And yet there are many good, interesting,
worthwhile jobs sprinkled among the top 20 on the agenda, so AM's
performance is cut by merely a factor of 3, as far as cpu time per
given major discovery. Part of this better-than-20 behavior is due
to the fact that the 18⊗Ath⊗* best task had a much lower priority
rating than the top few, hence was allocated much less cpu time for
its quantum than the top task would have received. Whether it
succeeded or failed, it used up very little time. Since AM was
frequently working on a low-value task, it was unwilling to spend
much time or space on it. So the mean time allotted per task fell to
about 15 seconds (from the typical 30 secs). Thus, the "losers" were
dealt with quickly, so the detriment to cpu-time performance was
softened.
Yet AM is much less rational in its sequencing of tasks. A topic will
be dropped right in the middle, for a dozen cycles, then picked up
again. Often a "good" task will be chosen, having reasons all of
which were true 10 cycles ago -- and which are clearly superior to
those of the last 10 tasks. This is what is so annoying to human
onlookers.
Conclusion: Picking (on the average) the 10th-best candidate impedes
progress by a factor less than 10 (about a factor of 3), but it
dramatically degrades the "sensibleness" of AM's behavior, the
continuity of its actions. Humans place a big value on absolute
sensibleness, and believe that doing something silly 50% of the time
is ⊗4↓_much_↓⊗* worse than half as productive as always doing the
next most logical task.
Corollary: Having 20 multi-processors simultaneously execute the top
20 jobs will increase the rate of "big" discoveries, but not by a full
factor of 20.
Another experiment in this same vein was done, one which was designed
to be far more crippling to AM. Be-threshhold was held at 0 always,
so ⊗4any⊗* task which ever got proposed was kept forever on the
agenda, no matter how low its priority. The Best-task function was
modified so it randomly selected any member of the list of jobs. As
a final insult, the Worth facets of all the concepts were initialized
to 200 before starting AM.
Result: Many "explosive" tasks were chosen, and the number of new
concepts increased rapidly. As expected, most of these were real
"losers". There seemed no rationality to AM's sequence of actions,
and it was quite boring to watch it floundering so. The typical
length of the agenda was about 500, and AM's performance was "slowed"
by at least a couple orders of magnitude. A more subjective measure
of its "intelligence" would say that it totally collapsed under this
random scheme.
Conclusion: Having an unlimited number of processors simultaneously
execute all the jobs
on the agenda would increase the rate at which AM made big discoveries, at an
ever accelerating pace (since the length of the agenda would grow exponentially).
Having a uniprocessor ⊗4simulate⊗* such parallel processing would be a losing idea,
however. The truly "intelligent" behavior AM exhibits is its plausible
sequencing of tasks.
Let's dig inside the agenda scheme now. One of the highly-touted
ideas in AM is the attaching of reasons to the tasks on the agenda,
and using those reasons and their ratings to compute the overall
prioirity value assigned to each task. An experiment was done to
ascertain the amount of intelligence that was emanating from that
idea.
The global formula assigning a priority value to each job was
modified. We let it still be a function of the reasons for the job,
but we "trivialized" it: the priority of a job was computed as simply
the number of reasons it has. (normalized by multiplying by 100, and
cut-off if over 1000).
This raised the new question of what to do if several jobs all have
the same priority. I suppose the answer is to execute them in
stack-order (most recent first)$$ Why? Because (i) it sounds right
intuitively to me,
(ii) this is akin to human focus of attention,
and mainly because (iii) this is what AM did
anyway, with no extra modification. $.
Result: <<This experiment hasn't been performed yet!>
Conclusion:
. SSSEC(What if certain key concepts are eliminated?)
Feeling in a perverse mood one day, I eliminated the concept
"Equality" from AM, to see what it would then do. Equality was a key
concept, because AM discovered Numbers via the technique of
generalizing the relation "Equality" (exact equality of 2 given
structures, at all internal levels). What would happen if we
eliminate this path? Will AM rederive Equality? Will it get to
Cardinality via another route? Will it do some set-theoretic things?
Result: <<This expt (elim. equality) is not yet done!>
Similar experiments: eliminate various other specific concepts, add a
few specific ones, modify some.
. SSSEC(What if certain heuristics are tampered with?)
add/delete/modify certain heuristics
not done yet
. SSSEC(Can AM work in a new domainα: Plane Geometry?)
This big experiment has already been done, but needs to be written up
nicely.
The author added a new base of concepts to the ones already in AM.
Included were: Point, Line, Angle, Triangle, Equality of
pts/lines/angles/triangles.
Results: fairly good behavior. Derived congruence, similarity of
triangles. Derived the idea of timberline in several ways.
Use for Goldbach's conjecture: any angle (0-180 degrees) can be built
up (to within 1 degree) as the sum of two angles of prime degrees
(<180).
.GEOEX: PAGE;